In [1]:
    
cat ../data/rose.fa
    
    
In [2]:
    
head ../data/contigs.fasta
    
    
In [3]:
    
zcat ../data/BJ-HSR1_R1.fastq.gz | head
    
    
where:
e - probability of a base being called wrongHow to encode it to text?
$Q_{phred} + 33$
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL................................................. 
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|                         |    |        |                              |                     |
33                        59   64       73                            104                   126
 0........................26...31.......40
@NS500159:12:H2FJ5AFXX:1:11101:12552:1058 1:N:0:1:
@NS500159 - machine id12 - run numberH2FJ5AFXX - flowcell id1 - lane11101 - tile number12552:1058 - x and y coordinates1 - read 1 or 2 (for paire ends)N - filtered (Y) or not (N)0 - always 0 for HiSeq and NextSeq1 - sample no from the sample sheetHeader lines start with @ and contain metadata: reference sequences names, lengths, aligner, etc.
Each alignment record contains 11 mandatory fields:
QNAME - query template name (think header from fastq file)FLAG - bitwise flag (more on it in a moment)RNAME - reference sequence name (e.g. chr1)POS - 1-based left-most mapping positionMAPQ - mapping quality (think uniqueness of the mapping)CIGAR - details of the mapping (match/mismatch/indel/clipping etc)RNEXT - reference sequence name for the pair (mate)PNEXT - mapping position for the pair (mate)TLEN - template (query) lengthSEQ - (aligned) segment sequence (not necessarily entire query sequence)QUAL - quality, as in fastqFLAG fieldThis is possibly the most important field in practical terms.
1 0x1 template having multiple segments in sequencing2 0x2 each segment properly aligned according to the aligner4 0x4 segment unmapped8 0x8 next segment in the template unmapped16 0x10 SEQ being reverse complemented32 0x20 SEQ of the next segment in the template being reverse complemented64 0x40 the first segment in the template128 0x80 the last segment in the template256 0x100 secondary alignment512 0x200 not passing filters, such as platform/vendor quality controls1024 0x400 PCR or optical duplicate2048 0x800 supplementary alignmentSame as sam but compresses and therefore is not directly readable. But because of the compression efficiency, it is the preferred way of storing alignment data.
You don't usually work with these directly, rather they are produced as intermediate results that get processed further to yield biologically relevant insights.
These are result of any alignment to reference you perform.
pileup tab delimited; records contain aggregate alignment data per reference position. match on the forward strand, match on the reverse strandACTGN mismatch on forward strandactgn mismatch on reverse strand+|-[0-9]ACTGNactgn insertion | deletion^ start of the read segment$ end of the read segmentgff (former gtf) genomic feature format; tab-delimited plain textbed generic position formatvcf variant call format
In [ ]: